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Abstract 

(^ . In this paper we describe a new efficient (in fact optimal) data structure for tlie top-K color 

^SJ ' problem. Eacfi element of an array A is assigned a color c with priority p{c). For a query range 

[a, b] and a value K, we have to report K colors with the highest priorities among all colors 
that occur in A[a..6], sorted in reverse order by their priorities. We show that such queries can 
be answered in 0{K) time using an O(A^logCT) bits data structure, where N is the number of 

QQ ' elements in the array and a is the number of colors. Thus our data structure is asymptotically 

optimal with respect to the worst-case query time and space. As an immediate application 
of our results, we obtain optimal time solutions for several document retrieval problems. The 

r/^ . method of the paper could be also of independent interest. 
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In this paper we study a variant of the well-known color reporting problem. Each entry of an array 
A is assigned a color c £ C with priority p{c). For a query range Q = [a,b] and an integer K, the 
data structure reports K distinct colors with highest priorities among all colors that occur in Q. 

^5 ' Colors are reported in the reverse order of their priorities. In the online version of this problem, we 

report all colors that occur in A[a..6] in decreasing order until all colors are reported or the query 
is terminated by the user. 

(^ • Using an 0{N\oga) bits data structure, we can answer such queries in 0{K) time, where K 

^^ . is the number of reported colors and a is the total number of distinct colors in A. Thus our data 

structure achieves worst-case optimal query time and space usage. Even for a simpler problem of 
reporting all distinct colors in A[a..&] in arbitrary order, the best previously known optimal time 

^ ■ data structure uses 0{N\ogN) bits. 

The study of this problem is motivated by its applications to document retrieval and search 
engines. It is known [15] that we can report all documents that contain a pattern P by reporting 
all distinct colors that occur in a range A[ap..6p] of the document array. In many cases, we want 
to output only most important or most relevant documents in sorted order starting with the most 
important (relevant) documents. The well known example of such scenario are search engines: an 
answer to a query is a sequence of documents output in the reverse order of their relevance. Static 
ranking of documents based on e.g. their links with other documents, such as PageRank [17,] and 
HITS [13j is an important part of estimating document relevance. Thus it may be beneficial to 
generate the list of K most highly ranked documents that contain a specified pattern P, sorted by 
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their ranks. The parameter K is sometimes not known in advance and documents must be reported 
in order of their ranks until the procedure is terminated. In such situations our data structure gives 
us an optimal time solution. Our result can be also applied to other document retrieval problems. 

Previous and Related Work. Colored range searching is a widely studied problem with 
various applications. In computational geometry and data structures, the following variant of the 
problem is considered. A set of points is stored in a data structure, so that for any rectangle Q 
distinct colors of all points in Q must be reported or the number of distinct colors must be counted. 
Such queries can be supported efficiently for / < 3 dimensions [l2l[ini[I]. Several related problems, 
in which distinct colors of geometric objects must be reported or counted were also studied. 

In the document listing problem, a set of documents di, . . . ,ds with total length N must be 
stored in a data structure, so that for any pattern P all documents that contain P must be 
reported. The total number of occurrences of P may significantly exceed the number of documents 
that contain P. Matias et al. |14] described the first data structure for this problems; their data 
structure answers document listing queries in 0(|-P| log s + docc) time, where |P| is the length of P 
and docc is the number of reported documents. Muthukrishnan [15] showed that several document 
retrieval problems can be reduced to colored searching problems. In [15] the author describes 
an 0(A^log A^) bits data structure that answers document listing query in optimal 0(|-P| + docc) 
time. The data structures of [181 [2Q] further improve the space usage by storing the documents in 
compressed form; however, their solutions do not achieve optimal query time: it takes 0(log^ A) 
time [18] or O(logs) time to report each document. The solution of Gagie et al. [8j, based on the 
wavelet tree, uses A logs bits but also needs suboptimal (|-P| + docc logs) bits to answer a query. 

The total number of documents that contain P can be very large and we may be interested 
in reporting only a subset of documents that contain the pattern P. In |15j . two such problems 
are considered. In the AT-mine problem, documents that contain at least K occurrences of pattern 
P must be reported. In the A-repeats two problems, we report all documents d, such that the 
minimal distance between two occurrences of the pattern P in d is at most K. In [15] . 0(A log N) 
bit data structures that solve both problems in 0(|-P| + docc) time are described. 

Instead of reporting all documents whose relevance score exceeds a certain threshold, we often 
want to report K most important or most relevant documents in sorted order. Recently, Hon 
et. al |llj addressed this problem and described an efficient framework for reporting the K most 
relevant documents with respect to the query pattern P. Their data structure uses linear space 
(i.e., 0(A^log A) bits) and can report K most relevant documents in 0(|P| + ATlogAT) time. They 
also describe a compressed data structure that supports queries in 0(|P| + Apolylog(A^)) time. In 
addition to static document ranks, the framework of |llj also supports other relevance metrics 

The problem of storing an array A, so that for any a < 6 all elements in A[a]A[a + 1] . . . A[b] 
can be output in sorted order was studied by Brodal et al [5. In [^ the authors obtained an 
0{N log A) bits and optimal 0(|6 — a + 1|) time solution for this problem. We observe that in this 
paper a different problem is studied: if array A contains colors and some color c occurs ric times in 
j4[a]j4[a + 1] . . . A[b] then the data structure of [5] reports this color Uc times. Our data structure 
returns the color c only once in this situation. The problem of ranked reporting was also considered 
by Grossi and Bialynicka-Birula [4] . They describe a general technique for adding rank information 
to geometric objects so that answers to range reporting queries can be ordered by rank. However, 
any data structure based on their method uses super-linear space and requires poly-logarithmic 
time to answer queries. For instance, a reduction of color queries to three-sided queries and their 
method result in an 0(Alog ^"^ A) bit data structure that answers queries in 0(log A + K) time. 



Our Results. We develop a new explicit technique for recursive, exponentially decreasing size 
subarrays combined with a new method for storing certain, pre-defined query answers. We show 
that an array A can be stored in an 0(A^ log cr) bits data structure so that for any two indexes 
a < b and for any integer K, K distinct colors with highest priorities among all colors that occur 
in ^[o..6] can be reported in optimal 0{K) time. In fact, it is not necessary to know K in advance: 
we can report colors that occur in ^[a..6] in the reverse order of their priorities until all colors are 
reported or the procedure is terminated by the user. Our method depends on transforming a data 
structure with 0{N^'^ + K), / > 1, query time into a data structure with optimal query time; 
two crucial components of this transformation are an efficient method for obtaining solutions for 
pre-defined intervals and recursively defined data structures with exponentially decreasing number 
of elements. 

Our data structure leads to optimal time solutions for document listing in situations when every 
document is assigned a static rank. 

Problem 1 (Ranked Document Listing Problem) Documents di, . . . ,ds are stored in a data 
structure, so that for any pattern P and any K we must return K most highly ranked documents 
that contain P ordered by their rank. 

The data structure of [H] uses 0{N log N) bits and solves this problem in 0(|P| + KlogK) time, 
where A^ is the total length of all documents. The compressed data structure of ^Tj uses CS'yll + 
o{N) + slog{N/s) bits, but requires 0(|-P| + Klog ~^^ N) time to answer a query, where |C5^| 
denotes the number of bits necessary to store compressed suffix array for all documents. We can 
solve the ranked document listing problem in optimal 0(|P| + K) time using worst-case optimal 
0(A^log,s) bits of space (in addition to the suffix array). Even for the general document listing 
problem, the previous optimal time data structure ^15j needs 0(A^log A^) additional bits of space. 

Problem 2 (Ranked t-Mine Problem) Documents di, . . . ,ds are stored in a data structure, so 
that for any pattern P and any K , t we must return K most highly ranked documents that contain 
P at least t times ordered by their rank. 

We can solve the ranked t-mine problem in 0{\P\ + K) time by a data structure that uses 0{N log s) 
words of log A^ bits. 

We can also combine our data structure with the framework of [11] and use a number of other 
relevance metrics. Let S{d^ P) denote the set of all positions in a document d, where P matches. 
The framework of [11] supports relevance metrics that depend on S{d,P). We will denote such a 
metric by rel((i, P). Examples of metrics rel(d, P) are freq, the frequency of occurrence of P in a 
document, and mindist, the minimal distance between two occurrences of P in a document. 

Problem 3 (Most Relevant Documents Problem) Documents di, . . . ,ds are stored in a data 
structure, so that for any pattern P and any K we must return K most relevant documents with 
respect to a metric rel((i, P) ordered by rel((i, P). 

The O(A^logA^) bit data structure of [H] supports most relevant documents queries in 0(|P| + 
KlogK) time. For some relevance metrics, the compressed data structure of [11] uses 2|CS'^| + 
slog(A^/s) + o{N) bits, but needs 0(|P| + Kpolylog(A^)) time to answer queries. For instance, if 
freq is chosen as the relevance metrics, then queries can be answered in 0(|P| + Klog "'"^ A^) time 
We show that using a linear space data structure, we can report K documents that contain 
a pattern P and are most relevant with respect to P in 0(|P| + i^log|P|) time. This is an 
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Table 1: Overview of results in the RAM model 



improvement over the first result of [TT] for the case when |P| = o{K). Moreover, if |P| = log ^ ' N, 
then our data structure supports most relevant document queries in optimal 0(|-P| + K) time. For 
instance, suppose that freq is used as relevance metric. Then for any pattern P such that \P\ = 
log^ ' N, we can report K documents in which P occurs most frequently in optimal 0(|-P| + K) 
time. 

Overview In section [2] we recall the results for standard one-dimensional color reporting and 
counting problems. In section [3] we present a simple data structure that finds the (unsorted) list 
of K colors with highest priorities in 0{K + log A^) time and uses 0(iV log N) bits of space. 
Essentially, our data structure is a wavelet tree [9j with secondary structures for color reporting 
and counting stored in its nodes. 

In sections [4] and [5] we describe a new approach that enables us to achieve optimal time and 
almost optimal (O(A^logA) bits) space. Our first idea, described in section |H is sparsification: 
we store data structures only for nodes situated on a constant number of levels. This allows us to 
achieve linear space because each element is stored in only a constant number of data structures. 
On the other hand, our new search procedure must visit a much larger number of nodes; therefore, 
the search time grows to 0{N^'^ + K) for a constant /. In section U] we show how the search time 
can be decreased without increasing space. First, we describe how we can obtain solutions for some 
pre-defined queries using linear space. We recursively combine this method with data structures of 
section m In section [6] we demonstrate that the space usage can be further reduced to 0{Nloga) 
bits; see Table [T] Besides that, our data structure can be also extended to the external memory 
model, as shown in section [71 Applications of our data structures to document retrieval problems 
are described in section [H 

Throughout this paper A[i..j] denotes the subarray that consists of elements A[i]j4[i + 1] . . . A[j]; 
[a, b] denotes an interval that consists of all integers x, a < x < b. For simplicity, we sometimes 
do not distinguish between elements and their colors. Our data structures use only additions, 
subtractions, and standard bit operations. We say that a data structure with N elements uses 
linear space if it can be stored in 0(Alog A) bits. 



2 Colored Reporting and Counting 



In the color reporting problem, each element of an array A is assigned a color c from the set of 
colors C. Given a query range [a, b], we must report all distinct colors ci, . . . ck, such that at least 
one element colored with Ci, 1 < i < K, occurs in A[a..6]. In the color counting problem, we must 
count the number of distinct colors that occur in A[a..6]. Both problems were studied extensively; 
we refer the reader to [10] for a survey of result^j- 



^Definitions of colored reporting and counting used in tliis paper are sligfitly more restrictive tfian tlie standard 
definitions of tliis problem. 



Lemma 1 In the RAM model, colored range reporting queries can be answered in 0{K) time using 
an 0{N log N) bits data structure. In the RAM model, the colored range counting problem can be 
solved in 0{logN) time using an 0{N log N) bits data structure. 

Proof: As shown in [lOj, the one-dimensional colored reporting (counting) for an array with A'^ 
elements can be reduced to the standard three-sided reporting (resp. counting) on N x N grid, i.e. 
to the problem of storing a set of two-dimensional points whose coordinates belong to an integer 
interval [1,A^] in a data structure, so that all points that belong to a query range of the form 
[xi, 2;2] X (yi, -l-oo] can be reported (counted). Three-sided reporting queries on the N x N grid can 
be answered in 0{K) time in the RAM model using an 0{N log N) bits data structure [3l|15]. Three- 
sided counting queries can be answered in 0(log A^) time using a linear space data structure [6]. D 

Lemma 2 In the external memory model, colored range reporting queries can be answered in 
0(loglog^ N + K/B) I/Os using an 0{N log N) bits data structure. In the external memory model, 
the colored range counting problem can be solved in 0{logN) I/Os using an 0{N log N) bits data 
structure. 

Proof: We use the same reduction to three-sided reporting (counting) as in Lemma [H There exists 
an O(A'logA') bits data structure that supports three-sided reporting queries in ©(loglog^n -|- 
K/B) I/O operations [16]. The result of [6] can be straightforwardly extended to the external 
memory model. D 

3 An 0{N log N) Space Data Structure 

In this section we consider a simple problem: K colors with highest priorities that occur in the 
query interval [a, 6] must be reported in an arbitrary order. The data structure described in this 
section is based on recursive partitioning of the set of colors based on their priorities. Thus our 
approach in this section is similar to the idea of the wavelet tree. Every node of a binary tree T 
is associated with a set of colors C^ and an array A^. If w is the root node, then A^ = A and 
Cf, = C. When A.^ and C^ for some node w of T are known, the arrays for the children of v can 
be constructed. The set of colors Cy is divided into two sets Cq and Ci that contain equal number 
of elementqj and all colors in Cq have smaller priorities than any color in Ci. We denote by N^ 
the total number of elements in Ay. We store an additional array By of A^^, bits; the i-th bit of By 
equals to 1 if and only if the color Ay [i] belongs to Ci . If tt and w are the right and left children of 
V, then we set C^ = Cq and Cy = Ci. The array Ay (A^) contains all elements of Ay whose colors 
belong to Cu (C^): if Ay[i] belongs to Cy and there are k indexes j, such that By\j] = and j < i, 
then Ay[i] is stored at position /j in A^; if Ay[i] belongs to Cw and there are li indexes j, such that 
By\j] = 1 and j < i, then Ay[i] is stored at position It in Ayj. Every array By is augmented with the 
rank/select data structure that enables us to count the number of I's or O's that occur in i3t,[l..j] 
for any j < Ny. Using By we can count the number of elements in ^^[l..j] whose colors belong to 

Uy or Uyj- 

Furthermore, we also store data structures COUNTy and REPy in each node v. The data 
structures COUNTy and REPy support color counting and color reporting queries on Ay. A tree 
T with arrays By is the standard wavelet tree. Thus our construction can be viewed as a wavelet 



^We assume that a = \C\ is a power of two. 



tree with auxiliary data structures for color reporting and color counting stored at its nodes. The 
height of T is logo" < log A^; hence, every element is stored in log cr secondary data structures. 

We will say that the interval [a„, 6„] corresponds to an interval [a, b] in a node v if all elements 
of 74[a..6] that belong to C^ appear in A^[a^..6„] in the same order as in A[a..6]. If we know a„ and 
by for a node v, then a„ and bu for the right child u of v can be found using By. We set a„ to the 
number of I's in ^^[l-.a^,]; if B[ay] = 0, then a^ is incremented by 1. We set by to the number of 
I's in ^^[1..6^]. Values of a^j and byj for the left child w can be found in a symmetric way. 

We can report the top K colors in the interval [a, b] using the algorithm that visits the sequence 
of nodes starting at the root of T. In every visited node we proceed as follows. Initially, we set 
Uy = a and by = b for the root node v. 

1. We use By to find [0^,6^] that corresponds to [a,b] in the right child u of v. 

2. We visit the node u and count the number m„ of distinct colors in Au[au--bu]- Obviously, m„ 
equals to the number of distinct colors from C„ that occur in A[a..b]. 

3. If niu > K, we report the top K colors in Au[au--bu] using the same procedure (i.e., we set v = u 
and return to step 1). 

4. If niu < K, we report all colors in Au[au--bu] using REPu- 

5. Then, we report the top K — rriu colors in the left child w of v. We use By to find ay, and b^, 
where [aw, bw] corresponds to [a, b] in w. Then, we set K = K — my, v = w, and return to step 1. 
The total number of visited nodes is 0(log A^). In every node we answer at most one color counting 
query and at most one color reporting query. By Lemma [H the counting query can be answered in 
O(logA^) time. The color reporting query in a node v can be answered in 0{K'^) time, where K'^ 
denotes the number of colors reported in the node v. Thus a query can be answered in 0(log N+K) 
time. 

Lemma 3 There exists an 0{N log N) bits data structure that outputs an unsorted list of top-K 
colors in a query interval [a, b] in 0(log N + K) time. 

4 A Linear Space Data Structure 

In this section we describe a linear space data structure that enables us to answer queries in 
0{N^''' + K) time. Our main idea is to store reporting and counting data structures only in 
selected nodes of the tree T, so that each element is stored only in a constant number of data 
structures. 

We say that a node v is on level x if v has x ancestors. We say that a node v is an important 
node if v is situated on i[{l/ f) log A^J-th level for z = 0, 1, . . . , / and a constant /, or f is a leaf 
node. Instead of storing the array Ay and the auxiliary data structures REPy and CNTy in every 
node V of T, we only store them in the important nodes. Besides that, we also store a data structure 
Ey in every important node. Let vi,. . . ,vt be the highest important descendants of v, i.e., each 
node Vi is an important node and there are no important nodes on the path between Vi and v. 
There are t < N^'^ highest important descendants of v. For any 1 <i <t and any 1 < j < Ny, the 
data structure Ey enables us to count elements with color c G Cy. at positions m < j of the array 
Ay. We can implement Ey as follows. For any 1 < i < t, we store the positions of all elements in 
Ay colored with a color from C^- in a standard one-dimensional range counting data structure [6]. 
Every such data structure uses 0{Ny. log A^^,) bits of space and answers queries in 0(log A^) time. 
Since Yll=i^v, = Ny, Ey uses 0(A^^logA^„) bits. 



All nodes v on the same level I = i[(l//) log A^J contain 0{N) elements. Hence all data 
structures REP^, CNTy and E^ for important nodes v situated on the same level use 0(A^log A^) 
bits of space. Since important nodes are situated on a constant number of levels, the total space 
usage is O(A^logA^) bits. 

The query answering procedure is similar to the one described in section [3l but only important 
nodes of T are visited. The search starts at the root; we set Uy = a, by = b, and i = t, where t is 
the number of the highest important descendants of v. 

1. Let tty- and 6^- denote the interval that corresponds to [a, b] in Vi. If we know a^ and by, we can 
find ay. and by- using Ey. Then, we use the data structure in the node Vi to compute the number 
of colors m^. in Ay^[ay...by^] and the sum tj = ^ ■ 



'j=i 



rrir. 



2. If rj < K, we visit Vi, report all ruy- colors that occur in Ay.[ay-..by-] and proceed with the child 

Vi-l- 

3. If ri > K, we set K = K — rj+i and use the same procedure to report top K colors that occur 

\ll Ay.[ay^..by^]. 

The total number of visited nodes is 0{fN^'^) = 0{N^'-'). In every node we answer at most one 
color counting and one color reporting query; hence we obtain an unsorted list of top K colors in 
0{N^/^ log N + K) time. If A' < N^'^, we can sort K colors by priorities in 0{N^'^ log A^) time. If 
K > N^'f, we can use the radix sorting^ and sort colors in 0{K) time. Thus a query is answered 
in 0{N^/f log A^ + a:) time. 

Since 0{N^'^ log A^) = 0{N^'^) for /' > /, we can substitute f > f in the above construction 
and obtain a data structure with 0{N) space and 0{N^'-') query cost for any constant /. 

Lemma 4 For any constant f, there exists an 0{N log N) bit data structure that supports top-K 
color queries in 0{N^''' + K) time. 

5 A Data Structure with 0{K) Query Time 

In this section we will use the result of Lemma S] for / = 2 as the starting point. Although the 
reporting time 0(\/iV + K) is very high, 0{N'^/'^ + K) = 0{K) li K = Q.{N'^/'^). Hence, the data 
structure of Lemma [His optimal for K = Q,{y/N). We can take care of the case when K < yiV 
colors must be reported by explicitly storing the solutions for some pre-defined queries and storing 
recursively defined data structures for subarrays. We start by explaining the main ideas of our 
approach; a more detailed description will be given later in this section. 

Our Approach. Lemma [H enables us to answer top- A' queries in 0{K + vN) = 0{K) time 
when K > y/N. Using the approach described below, we can store the answers to top- A' queries for 
K < y/N and for a set of intervals with 0{\/N) endpoints using linear space. Let J = {i\yN\ } 
and let A(m, a, b) denote the set of top m colors in A [a.. 6] sorted in the decreasing order by priorities. 
For every i G J and for every interval [i — T" , i\ and [i, z -|- 2*^], r = 1,2,..., log A^, we explicitly store 
the lists L[yfN , i — 2"^, i) and L{y/N , i,i + 2''). For any interval [a, b], such that a & J and 6 G J, we 
can represent [a, b] as a union of two (possibly intersecting) intervals [a, a + 2^] and [b — 2^, b] for 
X = [log(6 — a)\ . We can find the top K < \/N colors in A[a..b\ by examining the first K colors in 
L{\/N, a,a + 2^) and L{yN , b — 2^, b) and reporting the K colors with highest priorities. Hence, 
special queries on intervals A[a..6], where a and b are from J, can also be answered in 0{K) time. 



^ We assume that priorities of colors belong to the range [1, 0{N)]. If this is not the case, then priorities can be 
replaced by their ranks. 



We store the data structure of Lemma H] for each subarray j4[ii..i2], such that 12 follows ii in 
J. Since each data structure for ^[ii..i2] contains roughly viV elements, it supports queries in 
0{N^'^ + K) time. This query time is optimal for K > N^'^. We thus obtained a data structure 
with optimal query time for K > N^'^: each interval [a,b] can be represented as a union of three 
intervals [a,ai], [ai,^i] and [bi,b] such that ai € J and 61 G J. We can find sorted lists of top-K 
colors in all three intervals as described above; then, we can traverse the lists and identify the top 
colors in 0{K) time. 

We can apply the same construction once again and obtain optimal query time for K > N^'^. 
If we apply the same idea 0(loglog A^) times, then we can support queries in optimal 0{K) time 
for an arbitrary K. The precise description is given below. 

Data Structure. Let p{l) = (1/2)' and A = log^ N . We define the sets Ji, J2, . . . , A, where 
h = O(loglogiV), as Ji = {i- [iV^'^J • A | < i < A^^^^'C^/A}. The last index h is chosen so that 
ATpW = const. For every j e Ji, 1 < I < h, we store L([iV^(')], j - 2^j) and L{\NP^^^],j,j + 2') 
for r = 1, 2, . . . ,log A^. 

For every subarray ^[ji..J2], such that J2 follows ji in J; and 1 < I < h, we store the data 
structure R{l,ji,J2) implemented as in sectionHl For every subarray ^[ji..j2], such that J2 follows 
ji in Jh, we store the data structure -R(/i, j'l, 72)- R{h-,3i-,J2) is also implemented according to 
Lemma m but we set the constant / = 6, so that R{h,ji,J2) answers top-K queries in 0((J2 — 
ji)^ + K) time. For every subarray ^[ji..J2], such that J2 follows ji in Jh we also store a data 
structure -F(ji, J2) that supports top-i^ color queries in 0{K) time in the case when K < log^'^ N. 
This data structure will be described later in this section. Data structures R{-, •, •) and F{-, •) use 
modified sets of colors. Let C{ji,J2) be the set of all colors that occur in ^[ji..J2]. Let A^(ii, J2) 
denote the set in which every color in C(ji,J2) is replaced by the rank of its priority in C{ji,J2)- 
M{ji,J2) = {prank(c,C(ji,j2))|c G C(ji,J2)}, where prank(c, C) = \{c' £ C\p{c') <p(c)}. We 
store the data structure R{l,ji,J2) or F(ji, J2) for the set of colors A^(ji, J2) and assume that the 
priority of a color p G A^(ji,J2) equals to p. We also store a table Tbl{ji,J2) that enables us to 
find the color c G C that corresponds to a color p G M.{ji,J2)- In this way we guarantee that all 
colors and their priorities in i?(/, ji, J2) or F{ji,J2) belong to the range [1, J2 — ji + !]• 

Space Usage. We turn to the space analysis of our method. All lists L(-, •, •) use o{N) bits: for 
each /, there are 0{ ^i-^ — 5-n:) lists and each list uses 0{NP^^> log A^) bits. Hence, the total space 
usage of all lists for a fixed / is 0{N/ log A^) bits. 

Every table Tbl{ji,J2) for a data structure R{l,ji,J2) can be stored in 0{NP^^> \og{NP^^>)) bits. 
For every color p, 1 < p < |C(ji,J2)|, the p-th entry contains a pointer to the color Cp, such 
that prank(cp, C(ji, J2)) = p. The pointer to Cp is the relative position of an element of color Cp 
in j4[ji..J2]. That is, Tbl{ji,J2)\p] = t for some t such that the color of A[ji + t] is Cp. Since 
Cp G C{ji,J2), such h always exists. Since <h< NP^'-\ we need 0{NP^^^ \og{NP^^^)) bits to store 
the table. 

We can also store all data structures i?(-, •, •) inO(A^logA^) bits: every data structure i?(/,ji,J2) 
contains 0{NP^^') elements and colors of elements belong to the interval [1,A^''('']. Hence, by 
Lemma m we can store each R{l,ji,J2) in 0{NP^> log{NP^>)) bits. Thus for a fixed value of /, all 
data structures R{l,ji,J2) with tables Tbl{ji,J2) use 0{Np{l) log N) = p{l)0{N log N) bits of space 
(the constant hidden in the big Oh notation does not depend on /). Since X]i=o PiO ~ 0{^), all data 
structures R{-,-,-) use ©(A^logA") bits. We will show below that all F{-,-) also use O(A^logA^) 
bits. Thus the total space usage of our construction is O(A^logA^) bits. 

Answering Queries. The procedure for reporting top K colors in the range [a, 6] consists 



of the following steps. In step 1, we identify the actual number of reported colors Kq-. if j4[a..&] 
contains K' distinct colors, then Kq = min(i^',i^). In step 2, we represent [a, 6] as a union of at 
most three intervals. Top Kq colors in the middle interval A[ai..6i] can be found using lists L(-, •, •), 
as explained in step 3. During steps 4-6 we find top K colors in the two other intervals, ^[a..ai] 
and A[b..bi]. 

1. We check, whether the number of distinct colors in [a, 6] exceeds K. Using the data structure 
of Lemma [H we report colors in the interval [a, h\ until K colors are reported or the procedure 
stops. If the procedure stops when K' < K colors are reported, we set Kg = K' . Otherwise, we set 

Kq=K. 

2. We identify the largest value /, such that N^^' > Kq. We can find / by searching among 
h = O(loglogA^) different values in 0(1) time using the result of [7]. Let oi = [a/[A^''^"]] and 

6i = [6/[iVp(0]J. 

3. We identify the top Kg colors in [ai,6i] by computing x = log(ai — bi) and examining the top 
Kg colors in L{\NP^^^] , ai, ai + 2^) and L{ iNPil)] , hi - 2^, &i). 

4. lil < h, we use the data structure R{1, (ai - 1) [A^''(')] , ai \NP^'-^] ) to identify the top Kg colors 
in A[a..ai] in 0((7V''('))V2 + K^) ^ 0{Np'^^+^^ + Kg) time. Since Kq > ATpC+i), ^^ ^^^^^^ ^^ A[a..ai] 
are found in 0{Kg) time. The top Kg colors in 74[&i..6] are found in the same way. 

5 If Z = /i and Kg > log'/^N, we use the data structure R{h, (ai - l)[iV''('')], oi [iV^'')]). Since 
]^p{h) ^ o(log2 N), this takes 0{{\og^ Nf/^ + Kq) = 0{Kg) time by LemmaH The top Kq colors 
in 74[6i..6] are found in the same way. 

6. li I = h and Kg < log^/^A^, we use data structures F{{ai - l)\NP'^^^^,ai\NP'^^^^) and 
F((6i[iV''W], (6i + 1)[A^''W]) to report top Kg colors in A[a..ai] and A[hi..h]. 

7. When we know the top colors in A[a..ai], 74[ai..6i], and ^[6i..6], the top Kq colors in A[a..6] can 
be found in 0{Kg) time. 

Data Structure F. We describe the data structure F(ji,J2), where ji and j2 are two con- 
secutive indexes in J^- Since every color in 7W(ji, J2) belongs to [l,J2 — ji + 1] = [l,0(log A^)], 
every color in M.{ji,J2) can be specified with O (log log A^) bits. Let V{ji,J2) = {vi} for Vi = 
ji + i\_^/iogN\ and Vi < J2- For every Vi and for any r such that Vi + 2"^ < J2, we store the 

list L([log ' N~\,Vi,Vi + 2^'); for every Vi and for any r such that Vi — T' > ji, we store the list 

1/3 
L([log2 N~\,Vi — 2''',Vi). Since there are O (log log A^) different values of r for each fj, all lists L(-, •, •) 

use o{{J2 — ji)logN) bits of space. For any two consecutive indexes Vi and Uj+i in V{ji,J2), we 

store colors of all elements in ^[uj..Uj_|_i] in one machine word W{vi,Vi-^-i). Using one look-up table 

of size o{N) for all words stored in the data structure and standard bit operations we can answer 

top-K queries on A[fj..T;j4.i] in 0{K) time. 

Any interval [a, b] for ji < a < b < j2 can be represented as a union of intervals [a, aj], [uf, be], 
and [&e,^], where a/ = \{a — ji)/g'], b^ = [{b — ji)/g\ and g = \\/logW\. We can use lists 
L( [log^/^ N],af,af + 2^) and L( [log^/^ N],be-2'', be) for x = log(6e - a/) to find the top K colors 
in A[af..be]- We can find the top K colors in ^[a..a/] and ^[6e..6] in 0{K) time. Finally, we can 
merge the three lists and obtain the list of top K colors in 0{K) time. 

Thus we obtain 

Theorem 1 There exists an 0{N log N) bits data structure that supports top-K color queries in 
0{K) time. 

The data structure described above can be constructed in O(A^logA^) time using the following 



algorithm. Since all data structures R{-,-,-) contain 0(A^loglog A^) elements, all i?(-,-,-) can be 
constructed in 0(A^loglog A^) time. Data structures F{-, •) can also be constructed in 0{N) time. 

We construct a data structure of Lemma [3] in 0(A'^logA'^) time and use it to generate lists 
L(-, •, •). The total number of lists L(-, •, •) is 0{N log N/ A) and the total number of elements in all 
L(-, •, •) is 0{{N/A) log N log log N). Every list L{NP'^^\ji,J2) can be generated in 0(log^ N+Np'^^'^) 
time. Hence, ah hsts are constructed in 0(7V(log^ iV/A)+(iV/A) log N log log A^) = 0(A^(log^ N/A)) 
time. Thus the data structure of Theorem [1] can be constructed in 0(A^log A^) time. 

The result of Theorem [T] can be also extended to the external memory model. We will show it 
in section [7] 

6 An 0{N\oga) Bit Data Structure 

We can further improve the result of Theorem [T] in the case when a = o{N) and construct a data 
structure that uses 0(A^logo") bitqjl. In this section we assume w.l.o.g. that every color is an 
integer between 1 and a (if this is not the case, we replace the color by the rank of its priority). 

Our main idea is to divide the array into chunks, so that the data structure for each chunk uses 
O(logcr) bits per element. We also need the "global" data structure for searching in many chunks; 
this data structure contains 0{N/ log A^) elements and therefore can be stored in 0{N) bits. In the 
following description we will distinguish between two cases, a"^ > logA^ and o"^ < log A^. Although 
the data structures for both situations are based on the same idea, for ease of description we discuss 
the two cases separately. 

Case 1: cj^ > log A^. The array A is divided into chunks Li so that each chunk contains I = a^ 
elements, Li = A[{i — 1)^ + 1 ..ii\. We store the data structure of Theorem[T]for every chunk. Every 
such data structure uses 0{l\oga) bits; all chunk data structures use 0(A^loga") bits of space. 

Besides that, we store a top level data structure D^ for an auxiliary array A^ . The array A^ 
contains a entries for each chunk; the entries A [(« — 1)ct + 1 ..ia] contain information about colors 
that occur in the chunk Lj . If a color c occurs in Lj, then ^-^[(i — l)cr + c] is colored with c. If c does 
not occur in Lj, then A [(i — l)£ + c] is colored with a dummy color c©; we assume that the priority 
of cj) is smaller than the priority of any other color. We store the top-X data structure of Theorem[T] 
for A^. Since the total number of elements in A^ is 0{Na/t) = 0{N/a^) = 0{N/ log N), both 
A^ and L>^ use 0{N) bits. 

If a and b belong to the same chunk, then we can answer the query Q = [a, b] using the data 
structure for this chunk. Otherwise, a query range Q = [a,b] can be decomposed into [a, a'], 
[a' + 1, b'], and [b' + 1, b] for a' = \a/e\l and b' = [b/l\t Top K colors in A[a..a'] and A[b' + l..b] 
can be found using chunk data structures. Top K colors in ^[a' + 1 .. b'] can be found using D^: 
a color c occurs in ^[a' + 1 .. b'] if and only if c occurs in A-^[[o/£](T + 1 .. [6/£J(t]. Hence, we can 
find the top K colors in j4[a' + 1 ..b'] by identifying the top K colors in j4'^[[o/£](T + 1 .. [b/^J^] 
and discarding the dummy color c© if necessary. Since all three lists of top K colors are sorted by 
priorities, we can merge them and obtain top K colors in 0{K) time. 

Case 2: o"^ < logA'^. We use the same construction as for the case <t^ > log A^: the array A is 
divided into chunks and there is a data structure for a each chunk. There is also a data structure 
D for the array Al defined as above. But now each chunk Lj consists of a"^ [log A^J elements. It 
remains to describe the new chunk data structure. 



^In this section we assume that all colors are positive integers bounded by a^^- \ In the general case, our 
construction needs 0(o" log m) additional bits, where m is the maximal color 
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Every chunk Li is divided into 0((7^[log(Tj) pieces Vj and each piece consists of [log^. A^] 
elements. The array LJ contains a entries for each piece. If the j'-th piece Lj[j[logg. A^] + 1 .. {j + 
1) [fogo- A^]] contains an element of color c, then the color of Lf[ja + c] is c; otherwise the color of 
Lflja + c] is the dummy color c©. The top-K color data structure for Lf needs 0{a^ log a) bits 
of space. 

Every piece Vj contains 0{log^N) elements. Since the color of each element can be stored 
in 0(loga) bits, each piece fits into 0(1) words of log A'' bits. We can answer top- if queries on 
each piece using a pre-computed table T of size o{N). The table T contains information about 
all possible sequences seq^ of [log^ -^/2j colors. For every sequence seq; of [log^ -^/2j colors and 
for any 1 < xi < X2 < [logo- -^/^J , we store all colors that occur between xi and X2 in seq; sorted 
in decreasing order by their priorities. Since there are 0{-vN) sequences seq;, such a table uses 
0(viV log^ A^log^ o") = o{N) bits of space. We can decompose a piece into a constant number of 
sequences seqj of [log^. A^/2j colors; for every sequence seq^, we can look-up the corresponding 
entry in the table T and identify (at most) top-K colors between any two positions of seqj- in 0(1) 
time. Hence, we can answer a top- if query on a piece Vj in 0{K) time. 

We described in the first part of section [6] how a top- if query to an array A can be decomposed 
into two queries on chunks Lj^ and Ljj and one query to a data structure for A . In the same way, 
the top-if color query on a chunk Lj can be answered by answering two queries on some pieces 
Vji and Vj2 and one query to a data structure for Lj . Hence, a top- if query on a chunk can be 
answered in 0(if ) time. 

Thus we obtain the following result 

Theorem 2 Let a be the number of different colors. There exists an 0{N log a) bits data structure 
that supports top-K color queries in 0{K) time. 

The construction time of this data structure is 0(A^logo"); this can be shown in the same way as 
for the data structure of Theorem [TJ 

7 An External Memory Data Structure 

In the external memory model p], the data is stored on a disk in blocks and all computations 
are performed on data stored in the main memory of size M. A block consists of B words of 
logA^ bits. Using one I/O operation, we can read a block of data from disk or write it into disk. 
The time complexity of external memory algorithms is measured in I/O operations and the space 
usage is measured in blocks. Our data structure for top-if queries can be also extended to the 
external memory data structure that uses 0{{N/B) log log A^) blocks of space and answers queries 
in 0{{K/ B) log^ if) I/O operations (if the block size B is not very small). 

In this section we will distinguish between two variants of the top-if color problem. In the 
sorted top-if color problem, if colors with highest priorities must be reported in the descending 
order of their priorities. In the unsorted top-if color problem, if colors with highest priorities are 
reported in an arbitrary order. Sometimes we will specify the space usage of our data structures 
in bits. We observe, however, that if we describe an external memory data structure that uses 
0{s{N)) bits, then this data structure can be packed into 0{{s{N) /{B log N))) block of space. 

The data structure of Lemma [3] can be straightforwardly extended to the external memory 
model. The only difference is that data structures REP^ and CNT^ are implemented as explained 
in Lemma [21 
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Lemma 5 There exists an 0{N log N) bits external memory data structure that reports top-K 
colors in 0(log N + K/ B) time. 

The space usage can be further reduced to 0(A'^log A'^) bits. In external memory model, we can 
obtain an unsorted list of top K colors in the same way as explained in Lemma H] (counting and 
reporting data structures are replaced with their external memory counterparts). However, there 
is no external memory equivalent of the radix sort. Hence, we need 0{{K/B) log^ K) I/Os to sort 
the K colors. 

Lemma 6 For any constant f, there exists an 0{N log N) bits data structure that supports un- 
sorted top-K color queries in 0{N^'^ + K/B) I/Os. 

For any constant f , there exists an 0{N log N) bits data structure that supports top-K color queries 
in 0{N'^/f + (K/B) logs K) I/Os. 

We can extend the result of Theorem [T] to the external memory model. The only major differ- 
ence is that all data structures R{-, •, •) and F{-, •) use the same set of colors C. The data struc- 
tures R{-,-,-) are implemented using LemmaEl We can implement each data structure F{ji,J2) 
using Lemma El since F{ji,J2) contains m = 0(log N) elements, it can be implemented with 
0(?7ilog A^(loglog A^)) bits, so that queries are answered in 0((loglog A^)^ -|- K/B) time. On the 
other hand, \i B > log A^, then we can pack all elements of F{-, •) into one block of space and thus 
answer the top-i^ color queries on F{-, •) in 0{K/B) I/Os. 

Lemma 7 There exists an external memory data structure that uses 0{{N/B)\og\ogN) blocks of 
space and supports unsorted top-K color queries in 0((loglog A'^)^ -\- K/B) I/O operations. 
There exists an external memory data structure that uses 0{{N/ B) log log N) blocks of space and 
supports top-K color queries in 0((loglog A^)^ -|- {K/B)log^ K) I/O operations. 

We can further improve the query cost by bootstrapping: we use Lemma [7] to implement each data 
structure F{ji,J2). Since F{ji,J2) contains 0(log^A'") elements, unsorted queries are supported 
in 0((log^^^ N)^ + K/B) I/O operations. Here log^*) A^ is defined as log^*) A^ = log(log(*"^) A^) for 
t > 1 and log^^^ A^ = log A^. The improved data structure also supports queries in 0{K/B) I/Os if 
B > (log log A^)^. If we apply the same idea t times, we obtain the following result. 

Theorem 3 Let t be an arbitrary positive integer. There exists an external memory data struc- 
ture that uses 0{{N /B)loglogN) blocks of space and supports unsorted top-K color queries in 
0((log^*^ N)'^ + K/B) I/O operations. If B > (log*^*""*^^ A^)^, then queries are supported in 0{K/B) 
I/Os. 

There exists an external memory data structure that uses 0{{N /B)loglogN) blocks of space and 
supports top-K color queries in 0((log*^ ' N)'^-\-{K/B) log^ K) I/O operations. If B > (log^ ~ •' A^)^, 
then queries are supported in 0{{K/B)log^ K) I/Os. 

8 Online Queries 

We assumed in the above description that the data structure must report K top colors in the query 
interval, and the value of K is specified with the query. The same results are also valid in the 
online reporting scenario: the data structure reports top colors from a specified interval until the 
user terminates the reporting procedure or all colors in the interval are reported. 
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It suffices to apply tfie following trick, described in e.g. [5]. Let Ki = 2*. The reporting 
procedure consists of stages indexed hy i = —1,0,1,2,.... During the z-th stage we: (i) use the 
data structure of Theorem [1] or Theorem [3] to generate the sorted list £i+i of the top 2iCi_|_i — 1 
colors, (ii) remove the first Kj+i — 1 elements from £j+i, and (iii) if i > 0, output the colors from 

A. 

By Theorem [H we can find the top 2i^j+i — 1 colors in 0{2Ki^i) = 0{Ki) time. Hence, steps 
(i) and (ii) above take 0{Ki) time. The list Cq is produced during the initial (— l)-th stage in 0(1) 
time. For z > 0, we interleave steps (i) and (ii) with the step (iii): every time when we output one 
element of i2j, we spend 0(1) time on steps (ii) and (iii). Hence, the list £i+i is known when the 
stage i is completed. 

9 Document Retrieval 

Preliminaries. The generalized suffix tree for documents di,...,ds is the compact trie that 
contains all suffixes of the string di$i . . . ds-i$s-ids, where $i < $2 < • • • < $s are additional 
dummy symbols. The path of a node v is the string obtained by concatenating all edge labels on 
the path from the root to v. The locus of a pattern P is the highest node v, such that P is the 
prefix of the path of v. All occurrences of P correspond to the leaf descendants of the locus of v. 
The locus of P can be found in 0(|P|) time. We refer to e.g., [111120] and references therein for a 
more detailed description. 

Ranked Document Listing. We store all leaves of the generalized suffix tree in an array A. 
We set the color of the i-th element to Cj if and only if the suffix corresponding to the i-th leaf Ii 
belongs to the document dj ; the priority of the color Cj equals to the priority of the document dj , 
p{cj) = p{dj). In every internal node of the suffix array, we store the maximal and minimal index 
of its leaf descendants. 

Now the ranked document listing query can be answered as follows. We identify the locus f of a 
query pattern P in 0(|P|) time. Let min„ and max„ be the minimal and maximal indexes of the leaf 
descendants of v. Reporting top-X colors that occur in ^[min^, ..maxt^] is equivalent to reporting 
K most highly ranked documents that contain P in the reverse order of their ranks. Hence, we can 
answer a ranked document listing query in 0(|-P| + K) time. Additional space needed by our data 
structure is O(A^logs) bits, where A^ is the total length of all documents. 

Corollary 1 There exists an 0{N log N) bits data structure that supports ranked document listing 
queries in 0{\P\ + K) time. 

Ranked f-Mine Problem. Muthukrishnan [15] showed how we can identify all documents that 
contain at least t occurrences of a pattern P by reporting all colors in an array A*[ap..6p] that 
contains 0{N/t) elements. That is, for any pattern P we can identify indexes Op and bp in 0(|P|) 
time, so that a document contains at least t occurrences of P if and only if the corresponding color 
occurs in A*[ap..bp] at least once. Details about the array A* can be found in [15]. All A*' for all 
possible values of t contain 0{^^{N/t)) = 0{N log N) elements. 

We store the data structure Z?* of Theorem [2] for each A*. Using this data structure, we can 
report top-K colors in A^[ap..bp]. Obviously, this is equivalent to reporting K most highly ranked 
documents that contain t occurrences of a pattern P. Each D* can be stored in 0{{N/t) log s) bits, 
where s is the number of documents. Hence, all data structures -D* use O(A^logs) words of log A^ 
bits. 
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Corollary 2 There exists a data structure that uses 0{N\ogs) words, where s is the number of 
documents, and supports ranked t-mine queries in 0{\P\ + K) time. 

Most Relevant Documents Problem. Recently, Hon et al. |llj developed a framework for 
reporting K most relevant documents. In addition to reporting K most highly ranked documents, 
the structure of [IT] enables us to report K documents with highest scores, so that the score depends 
on both the document d and the pattern P. Combining their approach with our data structure, 
we can slightly improve their results. 

Using the construction of [llj, the problem of reporting K most highly scored documents with 
respect to a metric rel(d, P) is reduced to a problem of reporting K highest values that occur 
in |P| non-overlapping ranges ^[ai..6i], A[a2--b2],- ■ ■, •^[0'\p\--b\p\] of the array A. The array A 
of size 0{N) contains document identifiers, and every document occurs in ^[ai..6i], A[a2--b2],- ■ ., 
^[a|p|..6|p|] at most once. We refer to [11] for the description of their data structure. 

Our improvement is based on storing the array A in the data structure of Theorem [2j At the 
beginning of the search procedure, we identify the maximum element in every range ^[aj..6j] and 
store them in a heap data structure ( if \P\ > K, then we store only K largest elements in the heap). 
Then, we repeat the following steps K times: (i) we extract the maximum element from the heap 
and add it at the end of our list of top documents (ii) if the element belongs to the range A[aj..bj], 
we obtain the next highest value in ^[aj..6j] and add it to the heap. Since the heap contains |P| 
elements, extracting the maximum element from the heap and inserting a new element into the 
heap take 0(log |P|) time. If |P| = log^ ' N, then heap operations can be implemented in 0(1) 
time [7l[l9]. The data structure of [7] uses multiplications or other time-consuming operations. 
In our case, however, all elements stored in the heap are bounded by A^. Hence, we can replace 
each time consuming operation by 0(1) bit operations and look-ups in a table of size o{N) that 
can be initialized in o{N) time. As explained in section [HI finding the next largest element in the 
range takes 0(1) time. Thus the procedure takes 0(|-P| + /f log|P|) time or Od-P] + K) time if 
|P| = log^(i) N. 

Corollary 3 There exists an 0{N log N) bits data structure that supports most relevant documents 
queries in 0(|P| -|-i^log(|P|)) time. If\P\ = log ' ' N, then queries can be supported in 0{\P\+K) 
time. 

10 Conclusions 

In this paper we described a data structure for top-A' color reporting with optimal query time. 
The worst-case space usage of our data structure is also optimal. This result allows us to report 
most highly ranked documents that contain a query pattern P in optimal time using optimal 
worst-case space. While the recent compressed data structure of Hon et al. ^Hj uses less space 
it needs 0(log N) time to report each document. It is an interesting open question, whether 
we can construct a compressed data structure that requires significantly less time to report every 
document. 
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